feat: category-aware CC decision engine (Goal 3)#28
Open
bhuvan-somisetty wants to merge 1 commit into
Open
Conversation
Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal 1 detects what a sound is. Goal 2 detects whether anyone on screen reacted to it. But neither of those alone is enough to decide whether to add a caption - and that decision is the whole point of this tool.
This is Goal 3: the fusion engine that combines both signals and makes the call.
The core problem this solves
Without a decision engine, you have two bad options: caption everything (overcaption noise that viewers learn to ignore) or caption only sounds above some arbitrary single threshold (miss real events that had low audio confidence but strong visual reactions). Neither serves a hearing-impaired student watching a Hindi educational video.
The engine introduces a third approach: category-aware fusion, where the rules for firing a caption depend on what kind of sound was detected.
Decision logic — three tiers
HIGH_IMPACT events (gunshot, explosion, alarm, siren, glass breaking): a strong audio signal alone is sufficient. Visual reaction can rescue lower-confidence detections — if the camera reacts strongly to what might be a suppressed gunshot, that's worth a caption even if YAMNet only scored it 0.35.
AMBIENT events (music, rain, wind, traffic): these are the overcaption trap. A lesson with rain outside will trigger
[rain]at every YAMNet frame. The engine gates ambient sounds on a minimum visual reaction score first, then requires the weighted combined score to clear a higher threshold. Music playing in the background with no visible reaction from anyone on screen - no caption.GENERAL events (applause, crying, dog barking, tabla, firecrackers, etc.): audio leads at 60%, visual fills in the gap. A strong audio detection passes without any visual reaction. A borderline detection can still be accepted if the speaker clearly reacts.
What each decision looks like
Every
CCDecisioncarries a plain-Englishreasonfield explaining what happened:The reason strings are designed to be useful for editors reviewing suggestions - they can see at a glance whether a rejection was due to weak audio, no visual reaction, or just not clearing the combined threshold.
How it connects to the rest of the pipeline
AudioSignalandVisualSignalare thin dataclasses defined in this module - the engine does not import from Goal 1 or Goal 2 directly, so it can be reviewed and merged independently.FusionConfigexposes every threshold and weight as a dataclass field with documented defaults. Researchers who want to tune precision vs. recall for a specific content type can do so without touching the logic.Tests
38 tests cover all three category paths, the ambient reaction gate, the combined-score boundary conditions, timestamp and label preservation,
batch_decide, and customFusionConfigoverrides. India-specific labels ([tabla], [firecrackers]) have explicit test cases confirming they route through GENERAL rather than AMBIENT.No ML dependencies. Runs with
pip install pytest && pytest tests/.Refs #2